52 research outputs found

    Latency-Rate servers & Dataflow models

    Get PDF

    Stability Verification of Self-Timed Control Systems using Model-Checking

    Get PDF

    Interference control by best-effort process duty-cycling in chip multi-processor systems for real-time medical image processing

    Get PDF
    Systems with chip multi-processors are currently used for several applications that have real-time requirements. In chip multi-processor architectures, many hardware resources such as parts of the cache hierarchy are shared between cores and by using such resources, applications can significantly interfere with each other. In previous work, we showed that a single X-ray imaging streaming applications can be executed with low jitter on such systems. However, it was assumed that only one application would be running on the system, which prevents system integration where multiple real-time and best- effort applications are executing on a single chip multi-processor. In this paper, we address the limited bandwidth in the cache hierarchy, which can cause threads to interfere with each other significantly. We propose a technique that implements cache bandwidth reservation in software, by dynamically duty-cycling best-effort applications, based on their cache bandwidth usages using processor performance counters in order to control the influence of best-effort applications on real-time applications. With this technique we can control the latency increase of real- time applications that is caused by best-effort application in order to satisfy real-time requirements with a minimal reduction in best-effort performance. The results of the experiments with real- life applications indicate that we can control the increase of the latency to such an extent that we can almost completely eliminate the influence of bandwidth sharing in the cache at the cost of best-effort performance

    Enabling application-level performance guarantees in network-based systems on chip by applying dataflow analysis

    Get PDF
    A growing number of applications, often with real-time requirements, are integrated on the same system on chip (SoC), in the form of hardware and software intellectual property (IP). To facilitate real-time applications, networks on chip (NoC) guarantee bounds on latency and throughput. These bounds, however, only extend to the network interfaces (NI), between the IP and the NoC. To give performance guarantees on the application level, the buffers in the NIs must be sufficiently large for the particular application. At the same time, it is imperative to minimise the size of the NI buffers, as they are major contributors to the area and power consumption of the NoC. Existing buffer-sizing methods use coarse-grained application models, based on linear traffic bounds or periodic producers and consumers, thus severely limiting their applicability. In this work, the authors propose to capture the behaviour of the NoC and the applications using a dataflow model. This enables one to verify the temporal behaviour and to compute buffer sizes using existing dataflow analysis techniques. The authors show what is required from the NoC architecture and demonstrate how to construct an NoC model, with multiple levels of detail. Using the proposed model, buffer sizes are determined for a range of SoC designs with a run time comparable to existing analytical methods, and results comparable to exhaustive simulation. For an application case study, where existing buffer-sizing methods are not applicable, the proposed model enables the verification of end-to-end temporal behaviour

    Analytical approaches for performance evaluation of networks-on-chip

    Get PDF
    This tutorial reviews four popular mathematical formalisms – dataflow analysis, schedulability analysis, network calculus, and queueing theory – and how they have been applied to the analysis of Network-on-Chip (NoC) performance. We review the basic concepts and results of each formalism and provide examples of how they have been used in on-chip communication performance analysis. The tutorial also discusses the respective strengths and weaknesses of each formalism, their suitability for a specific purpose, and the attempts that have been made to bridge these analytical approaches. Finally, we conclude the tutorial by discussing open research issues

    Low-cost Guaranteed-Throughput dual-ring communication infrastructure for heterogeneous MPSoCs

    Get PDF
    Connection-oriented Guaranteed-Throughput (GT) mesh-based Networks on Chip (NoCs) have been proposed as a replacement for buses in real-time stream processing systems but are currently rarely used as hardware cost tends to be higher than conventional interconnects. Recently an interconnect with a ring topology was introduced as a low-cost alternative for use in medium scale homogeneous Multiple Processor System on Chip (MPSoC) designs. Cost-effective integration of stream processing accelerators would require an extension of this ring interconnect. We present a dual-ring communication infrastructure for heterogeneous MPSoC designs. Data and credits are transferred between tiles using their separate, oppositely directed, rings. The minimum throughput is determined by analysis of a Cyclo-Static Data Flow (CSDF) model for a system with communication between accelerators and processors. The performance benefits and costs are evaluated by integration of our dual ring and an accelerator in a 16 core MPSoC which is mapped on a Virtex6 FPGA. On this MPSoC a real-time PAL video decoder is executed. A performance gain of a factor 3.6 was obtained at an increase in hardware cost of only 8.5%

    Portable Memory Consistency for Software Managed Distributed Memory in Many-Core SoC

    Get PDF
    Porting software to different platforms can require modifications of the application. One of the issues is that the targeted hardware supports another memory consistency model. As a consequence, the completion order of reads and writes in a multi-threaded application can change, which may result in improper synchronization. For example, a processor with out-of-order execution could break synchronization if proper fence instructions are missing. Such a bug can cause sporadic errors, which are hard to debug. This paper presents an approach that makes applications independent of the memory model of the hardware, hence they can be compiled to hardware with any memory architecture. The key is having a memory model that only guarantees the most fundamental orderings of reads and writes, and annotations to specify additional ordering constraints. As a result, tooling can transparently and properly implement fences, cache flushes, etc. when appropriate, without losing flexibility of the hardware design. In a case study, several SPLASH-2 applications are run on a 32-core software cache coherent MicroBlaze system in FPGA. Moreover, this approach also allows mapping to scratch-pad memories and a distributed shared memory architecture

    Evaluation of scheduling heuristics for jitter reduction of real-time streaming applications on multi-core general purpose hardware

    Get PDF
    The real-time system research community has paid a lot of attention to the design of safety critical hard real-time systems for which the use of non-standard hardware and operating systems can be justi﬿ed. However, stream processing applications like medical imaging systems are often not considered safety critical enough to justify the use of hard real-time techniques that would increase the cost of these systems signi﬿cantly. Instead commercial off the shelf (COTS) hardware and OS are used, and techniques at the application level are employed to reduce the variation in the end-to-end latency of these imaging processing systems. In this paper, we study the effectiveness of a number of scheduling heuristics that are intended to reduce the latency and the jitter of stream processing applications that are executed on COTS multiprocessor systems. The proposed scheduling heuristics take the execution times of tasks into account as well as dependencies between the tasks, the data structures accessed by the tasks, and the memory hierarchy. Experiments were carried out on a quad core symmetric multiprocessing (SMP) Intel processor. These experiments show that the proposed heuristics can reduce the end-to-end latency with almost 60%, and reduce the variation in the latency with more than 90% when compared with a naive scheduling heuristic that does not consider execution times, dependencies and the memory hierarchy
    corecore